We will re-use the colon cancer data in GSE33126
library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "GSE33126_series_matrix.txt.gz"
if(!file.exists("GSE33126_series_matrix.txt.gz")) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
## File stored at:
## /tmp/Rtmpf4Tli9/GPL6947.soft
colonData
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48803 features, 18 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
## varLabels: title geo_accession ... data_row_count (31 total)
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803
## total)
## fvarLabels: ID nuID ... GB_ACC (30 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6947
exprs (colonData) <- log2 (exprs(colonData))
SampleGroup <- pData(colonData)$source_name_ch1
Patient <- pData(colonData)$characteristics_ch1.1
A first step in the analysis of microarray data is often to remove any uniformative probes. We can do this because typically only 50% of probes genes will be expressed, and even fewer than this will be differentially expressed. Including such non-informative genes in visualisa- tion will obscure any biological differences that exist. The genefilter package contains a suite of tools for the filtering of microarray data. The varFilter function allows probes with low- variance to be removed from the dataset. The metric using to decide which probes to remove is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the function used to do the filtering, and cut-off can be specified by the user.
library (genefilter)
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:base':
##
## anyNA
dim (colonData)
## Features Samples
## 48803 18
varFiltered <- varFilter (colonData)
dim (varFiltered)
## Features Samples
## 24401 18
nrow (colonData) / nrow (varFiltered)
## Features
## 2.000041
The first step towards clustering the data is to construct a distance matrix. Each entry in this matrix is the pairwise distance between two samples. Note that for expression data we have to transpose the standard ExpressionSet format, as clustering is designed to work on rows rather than columns. The standard function to make a distance matrix in R is dist which uses the euclidean metric. As the data entries in a distance matrix are symmetrical, it has a different representation to the default matrix (i.e. values are not repeated unnecessarily), and clustering algorithms are able to accept this representation as input.
N.B. to calculate the distances between samples, we have to transpose the expression matrix (e.g. using the function t). If we do not do this, R will try and compute distances between all genes which may take a long time or exceed the available memory)
euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820517 146.35076
## GSM820518 112.10192 145.53738
## GSM820519 127.02787 111.88327 128.89085
## GSM820520 103.88617 145.88415 93.26812 124.41043
## GSM820521 115.74149 109.59666 112.44568 77.60015 110.62329
## GSM820522 86.59678 140.12151 115.68936 131.12234 98.96944 117.26705
## GSM820523 118.07552 115.02450 139.86229 101.32561 134.43368 89.05971
## GSM820524 113.05465 135.64296 106.83091 126.25701 85.61877 117.19595
## GSM820525 142.82777 72.50357 145.43070 101.26515 138.18094 107.93645
## GSM820526 87.68842 160.67923 109.56682 126.28151 102.83077 114.59309
## GSM820527 102.51958 121.92622 119.62366 78.01042 110.89035 69.13310
## GSM820528 85.71746 132.44295 96.04791 115.36869 75.14097 103.00987
## GSM820529 107.07671 118.90261 120.08406 94.25470 114.46866 79.01849
## GSM820530 80.36987 151.46130 94.29680 123.97811 86.59315 103.45768
## GSM820531 108.26678 100.31635 115.03569 82.27479 112.72244 64.87145
## GSM820532 128.93571 95.33138 115.04522 106.11806 116.81059 102.43630
## GSM820533 106.21944 134.12719 115.78666 97.39843 115.37280 68.75089
## GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820517
## GSM820518
## GSM820519
## GSM820520
## GSM820521
## GSM820522
## GSM820523 105.62731
## GSM820524 85.85654 116.10403
## GSM820525 127.62848 99.13618 120.46802
## GSM820526 107.40340 134.03152 114.20701 154.74502
## GSM820527 106.32390 77.38321 115.56748 109.20371 104.05182
## GSM820528 89.80380 112.44894 90.45972 124.46834 97.24261 90.59615
## GSM820529 116.33941 83.80927 123.55735 107.46869 119.74686 59.39630
## GSM820530 97.19378 126.82149 107.83406 146.18764 87.43164 101.35185
## GSM820531 116.26826 91.51850 122.61883 103.07141 119.87820 70.55040
## GSM820532 124.92807 122.94258 116.57744 103.39879 137.24907 113.49486
## GSM820533 116.89886 93.17867 127.75562 130.21971 109.87582 64.16850
## GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
## GSM820517
## GSM820518
## GSM820519
## GSM820520
## GSM820521
## GSM820522
## GSM820523
## GSM820524
## GSM820525
## GSM820526
## GSM820527
## GSM820528
## GSM820529 86.75871
## GSM820530 81.22235 103.68352
## GSM820531 96.07646 70.24274 99.39146
## GSM820532 105.14902 109.33036 120.61473 93.09175
## GSM820533 99.91990 68.56935 96.27636 73.49531 119.19071
For gene-expression data, it is common to use correlation as a distance metric rather than the Euclidean. You should make sure that you know the difference between the two metrics. The cor function can be used to calculate the correlation of columns in a matrix. Each row (or column) in the resulting matrix is the correlation of that sample with all other samples in the dataset. The matrix is symmetrical and we can transform this into a distance matrix by first subtracting 1 from the correlation matrix. Hence, samples with a higher correlation have a smaller ’distance’.
corMat <- cor(exprs(varFiltered))
corMat
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820516 1.0000000 0.8582704 0.9171976 0.8937222 0.9286023 0.9112575
## GSM820517 0.8582704 1.0000000 0.8600812 0.9173538 0.8588151 0.9202012
## GSM820518 0.9171976 0.8600812 1.0000000 0.8907537 0.9425576 0.9163944
## GSM820519 0.8937222 0.9173538 0.8907537 1.0000000 0.8978211 0.9602207
## GSM820520 0.9286023 0.8588151 0.9425576 0.8978211 1.0000000 0.9187180
## GSM820521 0.9112575 0.9202012 0.9163944 0.9602207 0.9187180 1.0000000
## GSM820522 0.9503610 0.8696543 0.9115547 0.8864231 0.9349881 0.9085925
## GSM820523 0.9081569 0.9126298 0.8713382 0.9324985 0.8806670 0.9475820
## GSM820524 0.9156037 0.8781899 0.9247661 0.8949583 0.9514800 0.9089643
## GSM820525 0.8648318 0.9650722 0.8601046 0.9322218 0.8731558 0.9224893
## GSM820526 0.9493074 0.8293478 0.9209805 0.8950744 0.9301280 0.9131154
## GSM820527 0.9302064 0.9009531 0.9051381 0.9597308 0.9180982 0.9681213
## GSM820528 0.9511778 0.8829546 0.9388145 0.9117054 0.9623691 0.9291090
## GSM820529 0.9241083 0.9061505 0.9047172 0.9413361 0.9130390 0.9584997
## GSM820530 0.9573796 0.8482370 0.9414236 0.8987844 0.9504108 0.9291209
## GSM820531 0.9223823 0.9331725 0.9125297 0.9552921 0.9156400 0.9720189
## GSM820532 0.8897132 0.9395367 0.9123688 0.9254827 0.9092395 0.9300936
## GSM820533 0.9255052 0.8809038 0.9116292 0.9374967 0.9118986 0.9686817
## GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820516 0.9503610 0.9081569 0.9156037 0.8648318 0.9493074 0.9302064
## GSM820517 0.8696543 0.9126298 0.8781899 0.9650722 0.8293478 0.9009531
## GSM820518 0.9115547 0.8713382 0.9247661 0.8601046 0.9209805 0.9051381
## GSM820519 0.8864231 0.9324985 0.8949583 0.9322218 0.8950744 0.9597308
## GSM820520 0.9349881 0.8806670 0.9514800 0.8731558 0.9301280 0.9180982
## GSM820521 0.9085925 0.9475820 0.9089643 0.9224893 0.9131154 0.9681213
## GSM820522 1.0000000 0.9262929 0.9511794 0.8917089 0.9237271 0.9246435
## GSM820523 0.9262929 1.0000000 0.9111565 0.9350314 0.8817770 0.9603706
## GSM820524 0.9511794 0.9111565 1.0000000 0.9037952 0.9139681 0.9112436
## GSM820525 0.8917089 0.9350314 0.9037952 1.0000000 0.8415122 0.9204252
## GSM820526 0.9237271 0.8817770 0.9139681 0.8415122 1.0000000 0.9282016
## GSM820527 0.9246435 0.9603706 0.9112436 0.9204252 0.9282016 1.0000000
## GSM820528 0.9461775 0.9161095 0.9455812 0.8964670 0.9372344 0.9449786
## GSM820529 0.9101066 0.9536168 0.8988913 0.9232237 0.9051941 0.9764954
## GSM820530 0.9374851 0.8940675 0.9232363 0.8584350 0.9496136 0.9318156
## GSM820531 0.9101826 0.9446634 0.9003839 0.9293514 0.9049499 0.9668194
## GSM820532 0.8961047 0.8999407 0.9097986 0.9287621 0.8751776 0.9139221
## GSM820533 0.9094910 0.9427849 0.8921813 0.8875923 0.9203758 0.9726729
## GSM820528 GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
## GSM820516 0.9511778 0.9241083 0.9573796 0.9223823 0.8897132 0.9255052
## GSM820517 0.8829546 0.9061505 0.8482370 0.9331725 0.9395367 0.8809038
## GSM820518 0.9388145 0.9047172 0.9414236 0.9125297 0.9123688 0.9116292
## GSM820519 0.9117054 0.9413361 0.8987844 0.9552921 0.9254827 0.9374967
## GSM820520 0.9623691 0.9130390 0.9504108 0.9156400 0.9092395 0.9118986
## GSM820521 0.9291090 0.9584997 0.9291209 0.9720189 0.9300936 0.9686817
## GSM820522 0.9461775 0.9101066 0.9374851 0.9101826 0.8961047 0.9094910
## GSM820523 0.9161095 0.9536168 0.8940675 0.9446634 0.8999407 0.9427849
## GSM820524 0.9455812 0.8988913 0.9232363 0.9003839 0.9097986 0.8921813
## GSM820525 0.8964670 0.9232237 0.8584350 0.9293514 0.9287621 0.8875923
## GSM820526 0.9372344 0.9051941 0.9496136 0.9049499 0.8751776 0.9203758
## GSM820527 0.9449786 0.9764954 0.9318156 0.9668194 0.9139221 0.9726729
## GSM820528 1.0000000 0.9497681 0.9561975 0.9383683 0.9260060 0.9335861
## GSM820529 0.9497681 1.0000000 0.9288649 0.9672203 0.9204350 0.9688689
## GSM820530 0.9561975 0.9288649 1.0000000 0.9346090 0.9035232 0.9388144
## GSM820531 0.9383683 0.9672203 0.9346090 1.0000000 0.9422943 0.9642204
## GSM820532 0.9260060 0.9204350 0.9035232 0.9422943 1.0000000 0.9057118
## GSM820533 0.9335861 0.9688689 0.9388144 0.9642204 0.9057118 1.0000000
cor.dist <- as.dist(1 - corMat)
The values given by the cor function can be either positive or negative, depending on whether two samples are positively or negatively correlated. However, for our distance matrix to contain only values in the range 0 to 1. In which case we would need to use the absolute values from the correlation matrix before converting to distances.
cor.dist <- as.dist(1 - abs(corMat))
Hierachical clustering is the method by which samples with similar profiles are grouped to- gether, based on their distances. The method for grouping samples can be specified, with the default being to use complete linkage. Different linkage methods can affect the properties of the clustering output. e.g. the ’ward’ method tends to produce smaller, compact clusters. A popular way of displaying the results of clustering is a dendrogram, which arranges the sam- ples on the x-axis and shows the distances between samples on the y-axis. It is important to note that the distance of samples on the x-axis has no meaning. The fact that two samples are displayed next to each other, does not mean that they are “closest”. One has to use the y-axis to determine their distance
par(mfrow = c (1 , 2))
clust.euclid = hclust(euc.dist)
clust.cor = hclust (cor.dist)
par (mfrow = c (1 , 2))
plot(clust.euclid)
plot(clust.cor)
It is helpful to think of the dendrogram as a mobile that you might see hanging in a child’s nursery.
The default plotting for a dendrogram labels the “leaves” with the column names from the input matrix, in our case the sample names from GEO. This may make the interpretation of the dendrogram difficult, as it may not be obvious which sample group each sample belongs to. We can alter the appearance of the dendrogram so that sample groups appear in the labels.
par(mfrow = c (1 , 2))
clust.euclid = hclust(euc.dist)
clust.cor = hclust (cor.dist)
par (mfrow = c (1 , 2))
plot(clust.euclid,label=SampleGroup)
plot(clust.cor,label=SampleGroup)
The WGCNA package in Bioconductor provides methods for finding clusters of correlated genes, which we will not be looking at in this tutorial. However, the package is of interest as it provides other visualisation methods for dendrograms which allows colours to be overlaid to distinguish sample groups.
library(WGCNA)
## ==========================================================================
## *
## * Package WGCNA 1.49 loaded.
## *
## * Important note: It appears that your system supports multi-threading,
## * but it is not enabled within WGCNA in R.
## * To allow multi-threading within WGCNA with all available cores, use
## *
## * allowWGCNAThreads()
## *
## * within R. Use disableWGCNAThreads() to disable threading if necessary.
## * Alternatively, set the following environment variable on your system:
## *
## * ALLOW_WGCNA_THREADS=<number_of_processors>
## *
## * for example
## *
## * ALLOW_WGCNA_THREADS=8
## *
## * To set the environment variable in linux bash shell, type
## *
## * export ALLOW_WGCNA_THREADS=8
## *
## * before running R. Other operating systems or shells will
## * have a similar command to achieve the same aim.
## *
## ==========================================================================
groupColours <- as.factor(SampleGroup)
levels(groupColours) <- c ("yellow" , "blue")
plotDendroAndColors(clust.euclid,colors=groupColours)
If we want to interpret the data presented in a clustering analysis, we need a way of extract- ing which samples are grouped together, or to determine the optimal grouping of samples. One intuitive way of assigning groups it to ’cut’ the dendrogram at a particular height on the y-axis. We can do this manually on the plot, or use the cutree function to return the labels of samples that are belong to the same group when the dendrogram is cut at the specified height, h. Alternatively, we can specify how many groups, k, that we want to create.
library (cluster)
par (mfrow = c(1 , 1))
plot(clust.cor)
abline (h = 0.12, col = " red ")
cutree (clust.cor , h = 0.12)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
cutree (clust.cor , k = 2)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
table (cutree(clust.cor , k = 3) , SampleGroup)
## SampleGroup
## normal tumor
## 1 0 8
## 2 2 1
## 3 7 0
A Silhouette plot can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it ’fits’ the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. How- ever, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.
par (mfrow = c (2 , 2))
plot (silhouette(cutree(clust.cor, k=2),cor.dist),
col="red",main=paste("k=",2))
plot (silhouette(cutree(clust.cor, k=3),cor.dist),
col="red",main=paste("k=",3))
plot (silhouette(cutree(clust.cor, k=4),cor.dist),
col="red",main=paste("k=",4))
plot (silhouette(cutree(clust.cor, k=5),cor.dist),
col="red",main=paste("k=",5))
If we have prior knowledge of how many clusters to expect, we could run the clustering in a supervised manner. The Partition Around Medioids method can be used to group samples into k clusters.
pam.clus <- pam (euc.dist , k = 2)
clusplot (pam.clus)
pam.clus$clustering
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
table(pam.clus$clustering , SampleGroup)
## SampleGroup
## normal tumor
## 1 0 8
## 2 9 1
A heatmap is often used to visualise differences between samples. Each row represents a gene and each column is an array and coloured cells indicate the expression levels of genes. Both samples and genes with similar expression profile are clustered together. By default, euclidean distances are used with complete linkage clustering. Drawing a heatmap in R uses a lot of memory and can take a long time, therefore reducing the amount of data to be plotted is usually recommended. Including too many non- informative genes can also make it difficult to spot patterns. Typically, data are filtered to include the genes which tell us the most about the biological variation. In an un-supervised setting, the selection of such genes is done without using prior knowledge about the sample groupings.
IQRs = apply (exprs(varFiltered) , 1 , IQR )
highVarGenes = order (IQRs, decreasing = T )[1:100]
Symbols <- as.character(fData(colonData)$Symbol[highVarGenes])
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]),
labCol = SampleGroup , labRow = Symbols)
The default options for the heatmap are to cluster both the genes (rows) and samples (columns). However, sometimes we might want to specify a particular order. For example, we might want to order the columns according to sample groups. We can do this by re-ordering the input matrix manually and setting the Colv argument to NA. This tells the heatmap function not be cluster the columns.
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]),
labCol = SampleGroup , labRow = Symbols,Colv=NA)
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, order(SampleGroup)]),
labCol = SampleGroup[order(SampleGroup)], labRow = Symbols,Colv = NA)
Alternatively, a pre-calculated dendrogram could be used.
clus.ward <- hclust (cor.dist , method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )
The heatmap function can be customised in many ways to make the output more informa- tive. For example, the labRow and labCol parameters can be used to give labels to the rows (genes) and columns (sample) of the heatmap. Similarly, ColSideColors and RowSideColors give coloured labels, often used to indicate different groups which are know in advance. See the help page for heatmap for more details.
heatmap(as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
labCol = Patient, ColSideColors = as.character(groupColours),
labRow = Symbols)
The colours used to display the gene expression values can also be modified. For this, we can use the RColorBrewer package which has functions for creating pre-defined palettes. The function display.brewer.all can be used to display the palettes available through this package.
You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret!
library (RColorBrewer)
display.brewer.all()
hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , labRow = Symbols,
col=hmcol)
A popular use for heatmaps is to take an existing gene list (e.g. genes found to be significant in a previous study, or genes belonging to a particular pathway) and produce an image of how they cluster the data for exploratory purposes. This can be achieved by selecting the correct rows in the data matrix.
tests <- rowttests(exprs(varFiltered),SampleGroup)
N <- 50
myGenes <- rownames(tests)[1:N]
symbols <- fData(varFiltered)[myGenes,"Symbol"]
heatmap (as.matrix(exprs(varFiltered)[myGenes , ]),
ColSideColors = as.character (groupColours) , labCol = Patient ,
labRow = symbols , col = hmcol )
One drawback of the standard heatmap function is that it only allows one “track” of colours below the dendrogram. We might wish to display various sample groupings using this feature. The heatmap.plus package allows us to do just this.
library(heatmap.plus)
colourMatrix <- matrix(nrow=length(SampleGroup),ncol=2)
patientCol <- rep(rainbow(n=length(unique(Patient))),each=2)
colourMatrix[,1] <- as.character(groupColours)
colourMatrix[,2] <- patientCol
heatmap.plus (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
ColSideColors = as.matrix(colourMatrix) , labRow = Symbols,
col=hmcol)
Another alternative is provided by the gplots package. The heatmap.2 function can be used in the same fashion as heatmap. The plots produced include a colour legend for the cells in the heatmap. By default, a density plot of each column is also produced.
library(gplots)
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
heatmap.2 (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , labRow = Symbols,
col=hmcol)
We can turn-off the column density if we wish.
heatmap.2 (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , labRow = Symbols,
col=hmcol,trace="none")
Principal components analysis (PCA) is a data reduction technique that allows us to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and identify which factors explain the most variability in the data. We can use the prcomp function to perform a PCA and we have to supply it with a distance matrix. The resulting object contains information on the proportion of variance that is ’explained’ by each component. Ideally, we want to see that the majority of the variance is explained by the first 2 or 3 components, and that these components are associated with a biological factor
pca <- prcomp(exprs(varFiltered))
plot(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 7.171 1.13645 0.84320 0.6838 0.53334 0.50432
## Proportion of Variance 0.924 0.02321 0.01278 0.0084 0.00511 0.00457
## Cumulative Proportion 0.924 0.94724 0.96002 0.9684 0.97353 0.97810
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.44891 0.40378 0.38232 0.36830 0.32469 0.30894
## Proportion of Variance 0.00362 0.00293 0.00263 0.00244 0.00189 0.00172
## Cumulative Proportion 0.98172 0.98465 0.98728 0.98972 0.99161 0.99333
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.27589 0.27127 0.25520 0.24435 0.22922 0.2104
## Proportion of Variance 0.00137 0.00132 0.00117 0.00107 0.00094 0.0008
## Cumulative Proportion 0.99469 0.99602 0.99719 0.99826 0.99920 1.0000
The variable loadings are given by the $rotation matrix, which has one row for each sample and one column for each principal component. The principal components are ordered accorded to amount of variance explained. The actual values of a particular component have no meaning, but their relative values can be used to inform us about relationships between samples.
The Bioconductor project has a collection of example datasets. Often these are used as examples to illustrate a particular package or functionality, or to accompany the analysis presented in a publication. For example, several datasets are presented to accompany the genefu package which has functions useful for the classification of breast cancer patients based on expression profiles. An experimental dataset can be installed and loaded as with any other Bioconductor package. The data itself is saved as an object in the package. You will need to see the documentation for the package to find out the relevant object name. The full list of datasets available through Bioconductor can be found here
library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
## Features Samples
## 22283 344
dim(transbig)
## Features Samples
## 22283 198
annotation(vdx)
## [1] "hgu133a"
annotation(transbig)
## [1] "hgu133a"
If we want any classifers to be reproducible and applicable to other datasets, it is sensible to exclude probes that do not have sufficient annotation from the analysis. For this, we can use the genefilter package as before. The nsFilter function performs this annotation-based filtering as well as variance filtering. The output of the function includes details about how many probes were removed at each stage of the filtering.
library (genefilter)
vdx.filt <- nsFilter(vdx)
##
## Attaching package: 'IRanges'
##
## The following object is masked from 'package:gplots':
##
## space
vdx.filt
## $eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 6221 features, 344 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 206797_at 203440_at ... 201130_s_at (6221 total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
##
## $filter.log
## $filter.log$numDupsRemoved
## [1] 7408
##
## $filter.log$numLowVar
## [1] 6221
##
## $filter.log$numRemoved.ENTREZID
## [1] 2423
##
## $filter.log$feature.exclude
## [1] 10
vdx.filt <- vdx.filt[[1]]
Format the vdx data for pamr, and train a classifier to predict ER status. For extra clarity in the results, it might be useful to rename the binary er status used in the data package to something more descriptive.
library(pamr)
## Loading required package: survival
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )
Fitting the model
train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model
## Call:
## pamr.train(data = train.dat, n.threshold = 100)
## threshold nonzero errors
## 1 0.000 6221 41
## 2 0.130 5803 41
## 3 0.260 5365 39
## 4 0.390 4926 39
## 5 0.519 4501 39
## 6 0.649 4157 38
## 7 0.779 3789 38
## 8 0.909 3440 38
## 9 1.039 3144 38
## 10 1.169 2884 36
## 11 1.299 2624 34
## 12 1.428 2377 34
## 13 1.558 2197 34
## 14 1.688 2000 33
## 15 1.818 1823 34
## 16 1.948 1663 33
## 17 2.078 1538 33
## 18 2.208 1405 34
## 19 2.337 1281 33
## 20 2.467 1164 33
## 21 2.597 1067 32
## 22 2.727 994 32
## 23 2.857 910 33
## 24 2.987 823 33
## 25 3.117 753 33
## 26 3.246 697 32
## 27 3.376 643 31
## 28 3.506 577 32
## 29 3.636 522 33
## 30 3.766 460 33
## 31 3.896 398 33
## 32 4.026 360 33
## 33 4.155 319 33
## 34 4.285 284 33
## 35 4.415 246 33
## 36 4.545 219 33
## 37 4.675 191 33
## 38 4.805 170 34
## 39 4.935 155 34
## 40 5.064 132 34
## 41 5.194 117 35
## 42 5.324 103 35
## 43 5.454 89 36
## 44 5.584 77 36
## 45 5.714 68 36
## 46 5.844 63 36
## 47 5.973 57 36
## 48 6.103 54 37
## 49 6.233 49 37
## 50 6.363 44 37
## 51 6.493 41 37
## 52 6.623 37 37
## 53 6.753 34 36
## 54 6.882 32 37
## 55 7.012 28 37
## 56 7.142 27 37
## 57 7.272 25 38
## 58 7.402 23 39
## 59 7.532 19 41
## 60 7.661 17 42
## 61 7.791 17 42
## 62 7.921 17 43
## 63 8.051 17 43
## 64 8.181 16 43
## 65 8.311 15 44
## 66 8.441 13 43
## 67 8.570 11 48
## 68 8.700 8 48
## 69 8.830 7 48
## 70 8.960 6 51
## 71 9.090 6 54
## 72 9.220 6 56
## 73 9.350 5 57
## 74 9.479 4 58
## 75 9.609 4 59
## 76 9.739 4 63
## 77 9.869 3 66
## 78 9.999 3 69
## 79 10.129 3 73
## 80 10.259 3 79
## 81 10.388 3 100
## 82 10.518 3 118
## 83 10.648 2 134
## 84 10.778 2 135
## 85 10.908 2 135
## 86 11.038 1 135
## 87 11.168 1 135
## 88 11.297 1 135
## 89 11.427 1 135
## 90 11.557 1 135
## 91 11.687 1 135
## 92 11.817 1 135
## 93 11.947 1 135
## 94 12.077 1 135
## 95 12.206 1 135
## 96 12.336 1 135
## 97 12.466 1 135
## 98 12.596 1 135
## 99 12.726 1 135
## 100 12.856 0 135
We can perform cross-validation using the pamr.cv function. Printing the output of this function shows a table of how many genes were used at each threshold, and the number of classification errors. Both these values need to be taken into account when choosing a suit- able theshold. The pamr.plotcv function can assist with this by producing a diagnostic plot which shows how the error changes with the number of genes. In the plot produced by this function there are two panels; the top one shows the errors in the whole dataset and the bottom one considers each class separately. In each panel, the x axis corresponds to the thresh- old (and number of genes at each threshold) whereas the y-axis is the number of misclassifications.
model.cv <- pamr.cv(model , train.dat , nfold = 10)
## 12Fold 1 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 2 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 3 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 4 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 5 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 6 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 7 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 8 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 9 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 10 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model.cv
## Call:
## pamr.cv(fit = model, data = train.dat, nfold = 10)
## threshold nonzero errors
## 1 0.000 6221 44
## 2 0.130 5803 45
## 3 0.260 5365 45
## 4 0.390 4926 45
## 5 0.519 4501 44
## 6 0.649 4157 44
## 7 0.779 3789 43
## 8 0.909 3440 42
## 9 1.039 3144 40
## 10 1.169 2884 40
## 11 1.299 2624 40
## 12 1.428 2377 39
## 13 1.558 2197 36
## 14 1.688 2000 37
## 15 1.818 1823 39
## 16 1.948 1663 39
## 17 2.078 1538 40
## 18 2.208 1405 40
## 19 2.337 1281 40
## 20 2.467 1164 39
## 21 2.597 1067 39
## 22 2.727 994 37
## 23 2.857 910 36
## 24 2.987 823 35
## 25 3.117 753 36
## 26 3.246 697 36
## 27 3.376 643 36
## 28 3.506 577 36
## 29 3.636 522 36
## 30 3.766 460 35
## 31 3.896 398 35
## 32 4.026 360 35
## 33 4.155 319 35
## 34 4.285 284 34
## 35 4.415 246 34
## 36 4.545 219 34
## 37 4.675 191 34
## 38 4.805 170 34
## 39 4.935 155 34
## 40 5.064 132 34
## 41 5.194 117 34
## 42 5.324 103 36
## 43 5.454 89 36
## 44 5.584 77 36
## 45 5.714 68 36
## 46 5.844 63 37
## 47 5.973 57 37
## 48 6.103 54 38
## 49 6.233 49 38
## 50 6.363 44 38
## 51 6.493 41 38
## 52 6.623 37 38
## 53 6.753 34 38
## 54 6.882 32 39
## 55 7.012 28 37
## 56 7.142 27 37
## 57 7.272 25 39
## 58 7.402 23 39
## 59 7.532 19 40
## 60 7.661 17 41
## 61 7.791 17 41
## 62 7.921 17 42
## 63 8.051 17 43
## 64 8.181 16 44
## 65 8.311 15 47
## 66 8.441 13 47
## 67 8.570 11 48
## 68 8.700 8 50
## 69 8.830 7 51
## 70 8.960 6 53
## 71 9.090 6 53
## 72 9.220 6 56
## 73 9.350 5 59
## 74 9.479 4 59
## 75 9.609 4 61
## 76 9.739 4 64
## 77 9.869 3 65
## 78 9.999 3 73
## 79 10.129 3 84
## 80 10.259 3 100
## 81 10.388 3 116
## 82 10.518 3 123
## 83 10.648 2 127
## 84 10.778 2 133
## 85 10.908 2 134
## 86 11.038 1 135
## 87 11.168 1 135
## 88 11.297 1 134
## 89 11.427 1 135
## 90 11.557 1 135
## 91 11.687 1 135
## 92 11.817 1 135
## 93 11.947 1 135
## 94 12.077 1 135
## 95 12.206 1 135
## 96 12.336 1 135
## 97 12.466 1 135
## 98 12.596 1 135
## 99 12.726 1 135
## 100 12.856 0 135
pamr.plotcv(model.cv)
In the following sections, feel free to experiment with different values of the threshold (which we will call Delta) The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates the classes assigned to each sample against the original label assigned to the sample. e.g. Misclassifications are samples that we thought were ’ER+’ but have been assigned to the ’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.
Delta <- 8
pamr.confusion(model.cv , Delta)
## ER - ER + Class Error rate
## ER - 106 29 0.21481481
## ER + 14 195 0.06698565
## Overall error rate= 0.125
A visual representation of the class separation can be obtained using the pamr.plotcvprob function. For each sample there are two circles representing the probabilty of that sample being classified ER- (red) or ER+ (green).
pamr.plotcvprob(model , train.dat , Delta )
There are a couple of ways of extract the details of the genes that have been used in the classifier. We can list their names using the pamr.listgenes function, which in our case these are just returns the microarray probe names. We can however, use these IDs to query the featureData stored with the original vdx object. We can also plot the expression values for each gene, coloured according to the class label.
pamr.listgenes(model , train.dat , Delta )
## id ER --score ER +-score
## [1,] 205225_at -0.3257 0.2104
## [2,] 209173_at -0.202 0.1305
## [3,] 209603_at -0.1753 0.1133
## [4,] 204508_s_at -0.1184 0.0765
## [5,] 205009_at -0.0987 0.0638
## [6,] 214440_at -0.083 0.0536
## [7,] 205186_at -0.0583 0.0376
## [8,] 219197_s_at -0.0507 0.0327
## [9,] 209459_s_at -0.0453 0.0292
## [10,] 212956_at -0.0425 0.0274
## [11,] 218976_at -0.0402 0.026
## [12,] 203929_s_at -0.035 0.0226
## [13,] 204623_at -0.0325 0.021
## [14,] 215729_s_at 0.0295 -0.0191
## [15,] 209800_at 0.0211 -0.0136
## [16,] 218211_s_at -0.018 0.0116
## [17,] 205862_at -0.0109 0.0071
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
## id ER --score ER +-score
## [1,] 205225_at -0.3257 0.2104
## [2,] 209173_at -0.202 0.1305
## [3,] 209603_at -0.1753 0.1133
## [4,] 204508_s_at -0.1184 0.0765
## [5,] 205009_at -0.0987 0.0638
## [6,] 214440_at -0.083 0.0536
## [7,] 205186_at -0.0583 0.0376
## [8,] 219197_s_at -0.0507 0.0327
## [9,] 209459_s_at -0.0453 0.0292
## [10,] 212956_at -0.0425 0.0274
## [11,] 218976_at -0.0402 0.026
## [12,] 203929_s_at -0.035 0.0226
## [13,] 204623_at -0.0325 0.021
## [14,] 215729_s_at 0.0295 -0.0191
## [15,] 209800_at 0.0211 -0.0136
## [16,] 218211_s_at -0.018 0.0116
## [17,] 205862_at -0.0109 0.0071
pamr.geneplot(model , train.dat ,Delta)
You may get an error message Error in plot.new(): Figure margins too large when trying to produce the gene plot. If this occurs, try increasing the size of your plotting window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol- lowing code will write the plots to a pdf.
pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
main = Symbol )
}
dev.off()
## png
## 2
Use the genes identified by the classifier to produce a heatmap to confirm that they separate the samples as expected.
symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )
We can now test the classifier on an external dataset. We choose the transbig dataset for simplicity as it was generated on the same microarray platform
library (breastCancerTRANSBIG)
data (transbig)
pData (transbig)[1:4, ]
## samplename dataset series id filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
## VDXGUYU_4002 VDXGUYU_4002 TRANSBIG VDXGUYU 4002 GSM177885.CEL.gz 3.0 57 0 3 NA NA NA 1 723 0 723 1 0 1 937 1
## VDXGUYU_4008 VDXGUYU_4008 TRANSBIG VDXGUYU 4008 GSM177886.CEL.gz 3.0 57 1 3 NA NA NA 0 6591 0 183 1 0 1 6591 0
## VDXGUYU_4011 VDXGUYU_4011 TRANSBIG VDXGUYU 4011 GSM177887.CEL.gz 2.5 48 0 3 NA NA NA 1 524 0 524 1 0 1 922 1
## VDXGUYU_4014 VDXGUYU_4014 TRANSBIG VDXGUYU 4014 GSM177888.CEL.gz 1.8 42 1 3 NA NA NA 1 6255 0 2192 1 0 1 6255 1
transbig.filt <- transbig [featureNames(vdx.filt) , ]
predClass <- pamr.predict(model ,exprs(transbig.filt) ,Delta )
table (predClass, pData(transbig)$ er)
##
## predClass 0 1
## ER - 52 11
## ER + 12 123
boxplot (pamr.predict(model , exprs(transbig.filt), Delta ,
type = "posterior" )[, 1] ~ pData(transbig)$er)
Make a heatmap of the transbig data using the genes involved in the vxd classifier
erLab <- as.factor(pData(transbig)$er)
levels (erLab) <- c ("blue" , "yellow")
heatmap (exprs(transbig.filt)[classifierGenes , ] , labRow = symbols ,
ColSideColors = as.character (erLab))
An attractive feature of the vdx dataset is that it includes survival data for each breast can- cer patient. We are not explicitly covering survival analysis in this course, but for your reference, here are the commands to create survival curves when patients are grouped by ER status and tumour grade.
library (survival)
par (mfrow = c (1 , 2))
plot (survfit (Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$er) , col = c("cyan" , "salmon"))
plot (survfit(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$grade) , col = c("blue" , "yellow" , "orange"))
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$er)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
## pData(vdx)$er)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$er=0 135 38 45.5 1.244 2.04
## pData(vdx)$er=1 209 80 72.5 0.781 2.04
##
## Chisq= 2 on 1 degrees of freedom, p= 0.154
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$grade)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
## pData(vdx)$grade)
##
## n=197, 147 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$grade=1 7 0 3.06 3.06 3.21
## pData(vdx)$grade=2 42 10 16.69 2.68 3.53
## pData(vdx)$grade=3 148 61 51.26 1.85 6.72
##
## Chisq= 7.7 on 2 degrees of freedom, p= 0.0218